library(tidyverse)
── Attaching core tidyverse packages ──────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ lubridate 1.9.2     ✔ tibble    3.2.1
✔ purrr     1.0.1     ✔ tidyr     1.3.0
✔ readr     2.1.4     ── Conflicts ────────────────────────────── tidyverse_conflicts() ──
✖ tidyr::expand() masks Matrix::expand()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ tidyr::pack()   masks Matrix::pack()
✖ tidyr::unpack() masks Matrix::unpack()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

1 load & explore data

Load the housing_prices.csv data set and undertake an initial exploration of the data. You will find details on the data set on the relevant Kaggle page

housing_prices <- read_csv("data/housing_prices.csv")
Rows: 19675 Columns: 10── Column specification ───────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): ocean_proximity
dbl (9): longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_i...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(housing_prices)
Rows: 19,675
Columns: 10
$ longitude          <dbl> -122.23, -122.22, -122.24, -122.25, -122.25, -122.25, -122.25, -122.25, -122.26, -122.…
$ latitude           <dbl> 37.88, 37.86, 37.85, 37.85, 37.85, 37.85, 37.84, 37.84, 37.84, 37.84, 37.85, 37.85, 37…
$ housing_median_age <dbl> 41, 21, 52, 52, 52, 52, 52, 52, 42, 52, 52, 52, 52, 52, 52, 50, 52, 52, 50, 52, 40, 42…
$ total_rooms        <dbl> 880, 7099, 1467, 1274, 1627, 919, 2535, 3104, 2555, 3549, 2202, 3503, 2491, 696, 2643,…
$ total_bedrooms     <dbl> 129, 1106, 190, 235, 280, 213, 489, 687, 665, 707, 434, 752, 474, 191, 626, 283, 347, …
$ population         <dbl> 322, 2401, 496, 558, 565, 413, 1094, 1157, 1206, 1551, 910, 1504, 1098, 345, 1212, 697…
$ households         <dbl> 126, 1138, 177, 219, 259, 193, 514, 647, 595, 714, 402, 734, 468, 174, 620, 264, 331, …
$ median_income      <dbl> 8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.1200, 2.0804, 3.6912, 3.2031…
$ median_house_value <dbl> 452600, 358500, 352100, 341300, 342200, 269700, 299200, 241400, 226700, 261100, 281500…
$ ocean_proximity    <chr> "NEAR BAY", "NEAR BAY", "NEAR BAY", "NEAR BAY", "NEAR BAY", "NEAR BAY", "NEAR BAY", "N…

Dataset about areas - population, households, features of these. One character column (ocean_proximity), rest are numeric: location, local population size and number of households, number of rooms and bedrooms in the population’s households, median age (of the population from 1990 census), median income, median house value.

housing_prices %>% 
  summarise(across(everything(),
                   ~sum(is.na(.x))))

Only var with missing values is total_bedrooms, 200 NAs out of the 19,675 observations. Tbc how to deal with these, if at all.

2 check for explanatory vars that correlate

We expect the total_rooms of houses to be strongly correlated with total_bedrooms. Use ggpairs() to investigate correlations between these two variables.

It’s overkill to look at all the vars, so just look at the two of interest:

housing_prices %>% 
  select(total_rooms, total_bedrooms) %>% 
  GGally::ggpairs()

They look strongly positively correlated, and corr = 0.934 so that indicates a very strong positive correlation. (And *** indicate this is statistically significant, i.e. we can reject the null hypothesis that total_bedrooms does not affect total_rooms; it is very unlikely this data is from a world where the null hypothesis is true.)

3 trim data

drop total_bedrooms from the dataset, and use only total_rooms going forward

Because these variables are related, we should only use one of these in our model.

housing_prices <- housing_prices %>% 
  select(-total_bedrooms)

total_bedrooms was the only var with missing values, so we also don’t need to worry about these anymore.

4 explore numeric predictors

We are interested in developing a regression model for the median_house_value of a house in terms of the possible predictor variables in the dataset. i. Use ggpairs() to investigate correlations between median_house_value and the predictors (this may take a while to run, don’t worry, make coffee or something).

GGally::ggpairs(housing_prices)

From these, it looks like there are otherexplanatory variables that are strongly related:

Taking these into account, and looking at the magnitude of the correlation coefficients that are significant (*s), I would consider including the following in a model, in order of steps:

  1. income
  2. location (lat, long) - look at lat below
  3. demand (something derived from pop, household, rooms) - look at rooms below
  1. age
  1. Perform further ggplot visualisations of any significant correlations you find.
# income
housing_prices %>% 
  ggplot(aes(x = median_income, y = median_house_value)) +
  geom_point(fill = "grey60", size = 0.5) +
  geom_smooth(method = lm)

cor(housing_prices$median_house_value, housing_prices$median_income)
[1] 0.6426108

Interpret: There is a strong positive correlation between median house value and median income of an area. 64% of variation in house value is explained by variation in income.

This could be a useful explanatory variable.

Note also there seems to be an upper limit in median_house_value, odd lack of values above 500,000.

housing_prices %>% 
  ggplot(aes(x = latitude, y = median_house_value)) +
  geom_point(fill = "grey60") +
  geom_smooth(method = lm)

This is not a linear relationship - ther appear to be clusters where house price is more variable (higher peaks) - i.e. certain latitudes, I suspect these are where cities are.

Do not use this as a variable on its own (consider using location though - maybe even encoding as different areas?)

# total_rooms
housing_prices %>% 
  ggplot(aes(x = total_rooms, y = median_house_value)) +
  geom_point(fill = "grey60") +
  geom_smooth(method = lm)

This does not look homoeskedastic, lots of data points at lower total_rooms, but with high degree of spread of house values. Suggest total_rooms might need to be standardised by some factor?

# median_age
housing_prices %>% 
  ggplot(aes(x = housing_median_age, y = median_house_value)) +
  geom_point(fill = "grey60") +
  geom_smooth(method = lm)

No obvious correlation here. Do not use in model.

5 explore categorical predictor

Shortly we may try a regression model to fit the categorical predictor ocean_proximity. Investigate the level of ocean_proximity predictors. How many dummy variables do you expect to get from it?

housing_prices %>% 
  distinct(ocean_proximity)

There are 5 distinct categories in ocean_proximity, so expect to generate 4 dummy variables. Set inland as the reference, everything else gets more proximal to ocean.

housing_prices_trim <- housing_prices %>% 
  # set up dummy variables for ocean_proximity
  fastDummies::dummy_cols(select_columns = "ocean_proximity",
                          remove_selected_columns = TRUE) %>% 
  janitor::clean_names() %>% 
  # set inland as reference
  select(-ocean_proximity_inland)
# check for any potential correlations
housing_prices %>% 
  ggplot(aes(x = ocean_proximity, y = median_house_value)) +
  geom_boxplot()

House prices on island look higher than the rest, this may be a sensible predictor to use in a model.

Update: from boxplot, actually looks like 3 categories that are similar: inland, island, and proximal to ocean (combining the 3 others) –> which would suggest 2 dummy variables (with inland set as reference).

housing_prices %>% 
  summarise(count = n(), .by = ocean_proximity)

~20,000 observations of which, 8,600 are <1h to ocean, 5 are on island, 2000 are near bay, 2446 are near ocean, rest are inland (~6,500).

There are only 5 observations for island, so not enough data here to model this category, and also dragging up house prices - suggest dropping this category altogether for now.

So let’s use inland as reference, drop island observations, and combine the rest into a predictor “ocean_proximal”:

housing_prices_ocean <- housing_prices %>% 
  filter(ocean_proximity != "ISLAND") %>% 
  mutate(ocean_proximal = if_else(
    ocean_proximity == "INLAND",
    FALSE, TRUE)) %>% 
  select(-ocean_proximity)

head(housing_prices_ocean)

homework review

In R, we don’t need to manually make the dummy variables - just keep the ocean_proximity column and pass this into the model below. Similar results, see the coefficient estimates for each type of ocean_proximity.

Example:

# iris dataset, 1 continuous, 1 categorical var
lm1 <- lm(Petal.Length ~ Petal.Width + Species, iris)

head(model.matrix(lm1))
  (Intercept) Petal.Width Speciesversicolor Speciesvirginica
1           1         0.2                 0                0
2           1         0.2                 0                0
3           1         0.2                 0                0
4           1         0.2                 0                0
5           1         0.2                 0                0
6           1         0.4                 0                0
# shows that the Species var are dummy encoding automatically:

Also: it’s ok to keep the island in with 5 observations; the difference in sample size is not a problem. It might mean we are less empowered to make predictions about island house values later, but it’s ok to include in the model. Also ok to combine similar groups.

If you include n dummy variables, you’ll get one of them with a line of NAs in summary(model) <- a good sign of multicolinearity and you need to remove 1 dummy var.

6 simple linear regression with income

Start with simple linear regression. Regress median_house_value on median_income and check the regression diagnostics.

model_income <- lm(formula = median_house_value ~ median_income, housing_prices_trim)

library(ggfortify)
autoplot(model_income)

Intrepret:

# inspect the high leverage / influencing points
housing_prices_trim %>% 
  ggplot(aes(x = median_income, y = median_house_value)) +
  geom_point(fill = "grey60", size = 0.5) +
  geom_smooth(method = lm) +
  geom_text(aes(label = 1:nrow(housing_prices)), nudge_x = 0.5, nudge_y = 1, size = 2)

Points 11331, 17556 and 1555 appear to be the points of interest.

housing_prices[c(11331,17556,1555),]

Raw data doesn’t look suspect, not an obvious processing error. But note the population looks tiny, very few households. Might want to consider filtering for data where households is sizeable enough…

housing_prices %>% 
  ggplot(aes(x = population)) +
  geom_histogram(bins = 100, colour = "white")


housing_prices %>% 
  ggplot(aes(x = households)) +
  geom_histogram(bins = 100, colour = "white")

It doesn’t make sense to filter out low number households or population, given the distributions here - they don’t seem to be odd.

Try removing a couple of points:

Removing these values has not helped the model, it just shows even more extreme values (high income, low house price). So do not remove these points (continue with housing_prices_trim).

summary(model_income)

Call:
lm(formula = median_house_value ~ median_income, data = housing_prices_trim)

Residuals:
    Min      1Q  Median      3Q     Max 
-513966  -51564  -13979   36549  403935 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    45457.0     1359.0   33.45   <2e-16 ***
median_income  39987.0      339.9  117.64   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 74870 on 19673 degrees of freedom
Multiple R-squared:  0.4129,    Adjusted R-squared:  0.4129 
F-statistic: 1.384e+04 on 1 and 19673 DF,  p-value: < 2.2e-16

Here, the model suggests a function:

median_house_value (in $) ~= ~40,000*median_income + 45,500

the p-value (<2.2e-16) is very small (much below a 0.05 significance threshold), which indicates it is highly unlikely this non-zero coefficient would happen in a world where the null hypothesis is true (that median_income has no affect on median_house_value), so we can reject the null hypothesis and say there is enough evidence to suggest that income does affect house value.

The standard error is sizeable (~USD 75,000) - we are talking about house prices in the region of USD 300-500,000 with a an IQR of ~ 130 (the QQ range is ~ 120-250k in the boxplot below) so this error is half the IQR so is quite large for this to be a good model.

housing_prices %>% 
  ggplot(aes(x = median_house_value)) +
  geom_boxplot()

The R2 is 0.4129, which is an ok correlation for such a variable outcome (house value).

Overall this simple linear model does not meet several assumptions of linear regression, so we should not accept that this model is a good fit for our data. We need to improve it - we can try adding another predictor variable in…

7 add another var to model

Add another predictor of your choice. Check your assumptions, diagnostics, and interpret the model.

7a + island

Let’s try with ocean proximity: island.

Considering it as an independent variable, but not using the other ocean_proximities – not sure if this is a correct approach for non-binary categorical variables…

Oh this has not improved the model…

  • RvF - still got a wedge shape, so not linear
  • QQ - residuals still not normally distributed
  • S-L still got a wedge here too, so heterskedastic
  • RvL - shows many points are near the centroid, a couple of points are very far

This distribution doesn’t look so bad though - maybe it is shifted left?

summary(model_income_island)

Call:
lm(formula = median_house_value ~ median_income + ocean_proximity_island, 
    data = housing_prices_trim)

Residuals:
    Min      1Q  Median      3Q     Max 
-514154  -51494  -13940   36548  404045 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)             45320.1     1357.6  33.382  < 2e-16 ***
median_income           40008.7      339.6 117.828  < 2e-16 ***
ocean_proximity_island 225319.3    33449.9   6.736 1.67e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 74780 on 19672 degrees of freedom
Multiple R-squared:  0.4143,    Adjusted R-squared:  0.4142 
F-statistic:  6958 on 2 and 19672 DF,  p-value: < 2.2e-16
mosaic::plotModel(model_income_island)

I would expect to have seen parallel slopes here (one for island = TRUE, another for island = FALSE) - but there is only one line.

housing_prices_trim %>% 
  filter(ocean_proximity_island == 1)

There are only 5 observations with island == TRUE so cannot use this in our model!

Try again…

7b + house_size / occupancy

Let’s derive mean house_size using total_rooms/households

housing_prices_trim <- housing_prices_trim %>% 
  mutate(mean_house_size = total_rooms / households)

head(housing_prices_trim)
housing_prices_trim %>% 
  ggplot(aes(x = mean_house_size, y = median_house_value)) +
  geom_point(size = 0.5)

This is a no-goer - looks like majority of houses are around the same size, with a few extreme values…

What about demand, some kind of measure of households by population? Derive avg_occupancy

housing_prices_trim <- housing_prices_trim %>% 
  select(-mean_house_size) %>% 
  mutate(avg_occupancy = population / households) # indicates # people per household

head(housing_prices_trim)
housing_prices_trim %>% 
  ggplot(aes(x = avg_occupancy, y = median_house_value)) +
  geom_point(size = 0.5)

Now there are 7 points that look like extreme values… could we filter for avg_occupancy < 100??

This may be over-wrangling the data, but these few values do not make sense for a model looking at ~20,000 observations

nrow(housing_prices_trim)
[1] 19675
# > 100 filters 4 out
# > 50 filters 7 out
housing_prices_rm_high_occupancy <- housing_prices_trim %>% 
  filter(!avg_occupancy > 50)
housing_prices_rm_high_occupancy %>% 
  ggplot(aes(x = avg_occupancy, y = median_house_value)) +
  geom_point(size = 0.5)

Now we can see scatter of data, but still does not look to have any linear relationship. Could try scaling but I don’t think that is the problem here…

7c + ocean_proximity

Using house_prices_ocean for our dummy variable ocean_proximal:

housing_prices_ocean %>% 
  ggplot(aes(x = ocean_proximal, y = median_house_value)) +
  geom_boxplot()

There looks to be an effect of ocean proximity with house prices, where being close to the ocean has higher house prices, but there’s also a long tail of values for FALSE. Let’s try including this in our model as a second explanatory variable:

model_income_ocean <- lm(median_house_value ~ median_income + ocean_proximal, 
                         housing_prices_ocean)

This gives us 2 parallel lines - it looks as though being close to the ocean shifts median house_value higher.

Let’s diagnose this model:

  1. RvF - still a wedge shape, so not linear
  2. QQ plot does not suggest normally distributed at the extremes
  3. S-L - still a wedge shape so heterskedastic
  4. there may still be a couple of points that are influencing the model here

So the model does not match our assumptions of linear regression.

Let’s look at the model anyway…

summary(model_income_ocean)

Call:
lm(formula = median_house_value ~ median_income + ocean_proximal, 
    data = housing_prices_ocean)

Residuals:
    Min      1Q  Median      3Q     Max 
-482054  -42181  -11818   27998  376321 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         12000.4     1268.3   9.462   <2e-16 ***
median_income       34888.2      305.4 114.237   <2e-16 ***
ocean_proximalTRUE  78026.8     1018.6  76.599   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 65630 on 19667 degrees of freedom
Multiple R-squared:  0.5485,    Adjusted R-squared:  0.5485 
F-statistic: 1.195e+04 on 2 and 19667 DF,  p-value: < 2.2e-16

This model (if it were a good fit) suggests base median house value is USD 12,000 and increases with median income (multiplier effect, *35,000), and is also shifted up if the location is close to the ocean (by USD 78,000).

The R2 indicates the model is a good fit, and the p-value is way below a signficance threshold of 0.05, suggesting we can reject the null hypothesis (that these explanatory variables do not predict house value).

This model summary would look good, except that looking at autoplot indicates the linear model is not appropriate here.

To try: standardise or scale the variables somehow? But this won’t change the distribution of each (i.e. won’t shift the long tail of high house prices or high income…)

---
title: "wk10d2 homework"
output: html_notebook
---

```{r}
library(tidyverse)
library(janitor)
```

# 1 load & explore data

> Load the housing_prices.csv data set and undertake an initial exploration of the data. You will find details on the data set on the relevant [Kaggle page](https://www.kaggle.com/datasets/camnugent/california-housing-prices) 

```{r}
housing_prices <- read_csv("data/housing_prices.csv")
```

```{r}
glimpse(housing_prices)
```

Dataset about areas - population, households, features of these. One character column (ocean_proximity), rest are numeric: location, local population size and number of households, number of rooms and bedrooms in the population's households, median age (of the population from 1990 census), median income, median house value.

```{r}
housing_prices %>% 
  summarise(across(everything(),
                   ~sum(is.na(.x))))
```

Only var with missing values is total_bedrooms, 200 NAs out of the 19,675 observations. Tbc how to deal with these, if at all.

# 2 check for explanatory vars that correlate

> We expect the total_rooms of houses to be strongly correlated with total_bedrooms. Use ggpairs() to investigate correlations between these two variables.

It's overkill to look at all the vars, so just look at the two of interest:

```{r, message = FALSE, warning = FALSE}
housing_prices %>% 
  select(total_rooms, total_bedrooms) %>% 
  GGally::ggpairs()
```

They look strongly positively correlated, and corr = 0.934 so that indicates a very strong positive correlation. (And *** indicate this is statistically significant, i.e. we can reject the null hypothesis that total_bedrooms does not affect total_rooms; it is very unlikely this data is from a world where the null hypothesis is true.)

# 3 trim data

> drop total_bedrooms from the dataset, and use only total_rooms going forward

Because these variables are related, we should only use one of these in our model.

```{r}
housing_prices <- housing_prices %>% 
  select(-total_bedrooms)
```

total_bedrooms was the only var with missing values, so we also don't need to worry about these anymore.

# 4 explore numeric predictors

> We are interested in developing a regression model for the median_house_value of a house in terms of the possible predictor variables in the dataset.
i. Use ggpairs() to investigate correlations between median_house_value and the predictors (this may take a while to run, don’t worry, make coffee or something).

```{r, message = FALSE, warning = FALSE}
GGally::ggpairs(housing_prices)
```

From these, it looks like there are otherexplanatory variables that are strongly related:

* population, household and total_rooms are all correlated - they could be combined to create a derived variable as some kind of indicator of housing supply/demand, e.g. households * total_rooms / population (where > 1 is plenty of supply, < 1 is not enough supply, so some demand pressure)
* long and lat are two parts to a location, and are strongly related - they should be combined to indicate individual locations (or areas) in California (where this data is about)

Taking these into account, and looking at the magnitude of the correlation coefficients that are significant (*s), I would consider including the following in a model, in order of steps:

1. income
2. location (lat, long) - look at lat below
3. demand (something derived from pop, household, rooms) - look at rooms below
  - note: rooms is also an indicator of hoiuse size, could be useful on its own
4. age

> ii. Perform further ggplot visualisations of any significant correlations you find.

```{r}
# income
housing_prices %>% 
  ggplot(aes(x = median_income, y = median_house_value)) +
  geom_point(fill = "grey60", size = 0.5) +
  geom_smooth(method = lm)
```

```{r}
cor(housing_prices$median_house_value, housing_prices$median_income)
```

**Interpret:** There is a strong positive correlation between median house value and median income of an area. 64% of variation in house value is explained by variation in income.

This could be a useful explanatory variable.

Note also there seems to be an upper limit in median_house_value, odd lack of values above 500,000.


```{r}
# latitude
housing_prices %>% 
  ggplot(aes(x = latitude, y = median_house_value)) +
  geom_point(fill = "grey60") +
  geom_smooth(method = lm)
```

This is not a linear relationship - ther appear to be clusters where house price is more variable (higher peaks) - i.e. certain latitudes, I suspect these are where cities are.

Do not use this as a variable on its own (consider using location though - maybe even encoding as different areas?)

```{r}
# total_rooms
housing_prices %>% 
  ggplot(aes(x = total_rooms, y = median_house_value)) +
  geom_point(fill = "grey60") +
  geom_smooth(method = lm)
```

This does not look homoeskedastic, lots of data points at lower total_rooms, but with high degree of spread of house values. Suggest total_rooms might need to be standardised by some factor?

```{r}
# median_age
housing_prices %>% 
  ggplot(aes(x = housing_median_age, y = median_house_value)) +
  geom_point(fill = "grey60") +
  geom_smooth(method = lm)
```

No obvious correlation here. Do not use in model.

# 5 explore categorical predictor

> Shortly we may try a regression model to fit the categorical predictor ocean_proximity. Investigate the level of ocean_proximity predictors. How many dummy variables do you expect to get from it?

```{r}
housing_prices %>% 
  distinct(ocean_proximity)
```

There are 5 distinct categories in ocean_proximity, so expect to generate 4 dummy variables. **Set inland as the reference**, everything else gets more proximal to ocean.

```{r}
housing_prices_trim <- housing_prices %>% 
  # set up dummy variables for ocean_proximity
  fastDummies::dummy_cols(select_columns = "ocean_proximity",
                          remove_selected_columns = TRUE) %>% 
  janitor::clean_names() %>% 
  # set inland as reference
  select(-ocean_proximity_inland)
```

```{r}
# check for any potential correlations
housing_prices %>% 
  ggplot(aes(x = ocean_proximity, y = median_house_value)) +
  geom_boxplot()
```

House prices on island look higher than the rest, this may be a sensible predictor to use in a model.


**Update:** from boxplot, actually looks like 3 categories that are similar: inland, island, and proximal to ocean (combining the 3 others) --> which would suggest 2 dummy variables (with inland set as reference).

```{r}
housing_prices %>% 
  summarise(count = n(), .by = ocean_proximity)
```

~20,000 observations of which, 8,600 are <1h to ocean, 5 are on island, 2000 are near bay, 2446 are near ocean, rest are inland (~6,500).

There are only 5 observations for island, so not enough data here to model this category, and also dragging up house prices - suggest dropping this category altogether for now.

So let's use inland as reference, drop island observations, and combine the rest into a predictor "ocean_proximal":

```{r}
housing_prices_ocean <- housing_prices %>% 
  filter(ocean_proximity != "ISLAND") %>% 
  mutate(ocean_proximal = if_else(
    ocean_proximity == "INLAND",
    FALSE, TRUE)) %>% 
  select(-ocean_proximity)

head(housing_prices_ocean)
```

### homework review

In R, we don't need to manually make the dummy variables - just keep the ocean_proximity column and pass this into the model below. Similar results, see the coefficient estimates for each type of ocean_proximity.

Example: 

```{r}
# iris dataset, 1 continuous, 1 categorical var
lm1 <- lm(Petal.Length ~ Petal.Width + Species, iris)

head(model.matrix(lm1))
# shows that the Species var are dummy encoding automatically (3 species, missing the first species in alphabetical order):
```

Also: it's ok to keep the island in with 5 observations; the difference in sample size is not a problem. It might mean we are less empowered to make predictions about island house values later, but it's ok to include in the model. Also ok to combine similar groups.

If you include n dummy variables, you'll get one of them with a line of NAs in summary(model) <- a good sign of multicolinearity and you need to remove 1 dummy var.

# 6 simple linear regression with income

> Start with simple linear regression. Regress median_house_value on median_income and check the regression diagnostics.

```{r}
model_income <- lm(formula = median_house_value ~ median_income, housing_prices_trim)
```

```{r}
# as above
mosaic::plotModel(model_income)
```

```{r}
library(ggfortify)
autoplot(model_income)
```

Intrepret:

* RvF plot has a big wedge shape, which suggests it is heterskedastic - we need to add another variable to explain the variation more consistently across x values and generate a good linear model here
* QQ plot is not linear, the residuals are not normally distrubted, worse at the extremes, look negatively skewed (right skewed) in the histogram below
* Scale-Location - looks to be some potential outliers, points 11331 and 1755.?. (cannot expand plot to see)
* These points and another point (1565..?) have high leverage and look to be influencing in the RvL plot

```{r}
# inspect distribution of the residuals
hist(model_income$residuals)
```

```{r}
# inspect the high leverage / influencing points
housing_prices_trim %>% 
  ggplot(aes(x = median_income, y = median_house_value)) +
  geom_point(fill = "grey60", size = 0.5) +
  geom_smooth(method = lm) +
  geom_text(aes(label = 1:nrow(housing_prices)), nudge_x = 0.5, nudge_y = 1, size = 2)
```

Points 11331, 17556 and 1555 appear to be the points of interest.

```{r}
housing_prices[c(11331,17556,1555),]
```

Raw data doesn't look suspect, not an obvious processing error. But note the population looks tiny, very few households. Might want to consider filtering for data where households is sizeable enough...

```{r}
housing_prices %>% 
  ggplot(aes(x = population)) +
  geom_histogram(bins = 100, colour = "white")

housing_prices %>% 
  ggplot(aes(x = households)) +
  geom_histogram(bins = 100, colour = "white")
```

It doesn't make sense to filter out low number households or population, given the distributions here - they don't seem to be odd.

Try removing a couple of points:

```{r}
# remove the two most concerning points
housing_prices_rm <- housing_prices_trim %>% 
  slice(-c(11331,17556))

model_income_rm <- lm(median_house_value ~ median_income, housing_prices_rm)

autoplot(model_income_rm)
```

Removing these values has not helped the model, it just shows even more extreme values (high income, low house price). So do not remove these points (continue with `housing_prices_trim`).

```{r}
summary(model_income)
```

Here, the model suggests a function:

median_house_value (in $) ~= ~40,000*median_income + 45,500

the p-value (<2.2e-16) is very small (much below a 0.05 significance threshold), which indicates it is highly unlikely this non-zero coefficient would happen in a world where the null hypothesis is true (that median_income has no affect on median_house_value), so we can reject the null hypothesis and say there is enough evidence to suggest that income does affect house value.

The standard error is sizeable (~USD 75,000) - we are talking about house prices in the region of USD 300-500,000 with a an IQR of ~ 130 (the QQ range is ~ 120-250k in the boxplot below) so this error is half the IQR so is quite large for this to be a good model.

```{r}
housing_prices %>% 
  ggplot(aes(x = median_house_value)) +
  geom_boxplot()
```

The R2 is 0.4129, which is an ok correlation for such a variable outcome (house value).

**Overall** this simple linear model does not meet several assumptions of linear regression, so we should not accept that this model is a good fit for our data. We need to improve it - we can try adding another predictor variable in...

# 7 add another var to model

> Add another predictor of your choice. Check your assumptions, diagnostics, and interpret the model.

## 7a + island

Let's try with ocean proximity: island. 

Considering it as an independent variable, but not using the other ocean_proximities -- not sure if this is a correct approach for non-binary categorical variables...

```{r}
model_income_island <- lm(median_house_value ~ median_income + ocean_proximity_island, housing_prices_trim)
```

```{r}
autoplot(model_income_island)
```

Oh this has not improved the model... 

* RvF - still got a wedge shape, so not linear
* QQ - residuals still not normally distributed
* S-L still got a wedge here too, so heterskedastic
* RvL - shows many points are near the centroid, a couple of points are very far

```{r}
hist(model_income_island$residuals)
```

This distribution doesn't look so bad though - maybe it is shifted left?

```{r}
summary(model_income_island)
```

```{r}
mosaic::plotModel(model_income_island)
```

I would expect to have seen parallel slopes here (one for island = TRUE, another for island = FALSE) - but there is only one line.

```{r}
housing_prices_trim %>% 
  filter(ocean_proximity_island == 1)
```
There are only 5 observations with island == TRUE so cannot use this in our model!

Try again...

## 7b + house_size / occupancy

Let's derive mean house_size using total_rooms/households

```{r}
housing_prices_trim <- housing_prices_trim %>% 
  mutate(mean_house_size = total_rooms / households)

head(housing_prices_trim)
```

```{r}
housing_prices_trim %>% 
  ggplot(aes(x = mean_house_size, y = median_house_value)) +
  geom_point(size = 0.5)
```

This is a no-goer - looks like majority of houses are around the same size, with a few extreme values...

What about demand, some kind of measure of households by population? Derive avg_occupancy

```{r}
housing_prices_trim <- housing_prices_trim %>% 
  select(-mean_house_size) %>% 
  mutate(avg_occupancy = population / households) # indicates # people per household

head(housing_prices_trim)
```

```{r}
housing_prices_trim %>% 
  ggplot(aes(x = avg_occupancy, y = median_house_value)) +
  geom_point(size = 0.5)
```

Now there are 7 points that look like extreme values... could we filter for avg_occupancy < 100??

This may be over-wrangling the data, but these few values do not make sense for a model looking at ~20,000 observations

```{r}
nrow(housing_prices_trim)
```

```{r}
# > 100 filters 4 out
# > 50 filters 7 out
housing_prices_rm_high_occupancy <- housing_prices_trim %>% 
  filter(!avg_occupancy > 50)
```

```{r}
housing_prices_rm_high_occupancy %>% 
  ggplot(aes(x = avg_occupancy, y = median_house_value)) +
  geom_point(size = 0.5)
```

Now we can see scatter of data, but still does not look to have any linear relationship. Could try scaling but I don't think that is the problem here...

## 7c + ocean_proximity

Using `house_prices_ocean` for our dummy variable ocean_proximal:

```{r}
housing_prices_ocean %>% 
  ggplot(aes(x = ocean_proximal, y = median_house_value)) +
  geom_boxplot()
```

There looks to be an effect of ocean proximity with house prices, where being close to the ocean has higher house prices, but there's also a long tail of values for FALSE. Let's try including this in our model as a second explanatory variable:

```{r}
model_income_ocean <- lm(median_house_value ~ median_income + ocean_proximal, 
                         housing_prices_ocean)
```

```{r}
mosaic::plotModel(model_income_ocean)
```

This gives us 2 parallel lines - it looks as though being close to the ocean shifts median house_value higher.

Let's diagnose this model:

```{r}
autoplot(model_income_ocean)
```

1. RvF - still a wedge shape, so not linear
2. QQ plot does not suggest normally distributed at the extremes
3. S-L - still a wedge shape so heterskedastic
4. there may still be a couple of points that are influencing the model here

So the model does not match our assumptions of linear regression.

Let's look at the model anyway... 

```{r}
summary(model_income_ocean)
```

This model (if it were a good fit) suggests base median house value is USD 12,000 and increases with median income (multiplier effect, *35,000), and is also shifted up if the location is close to the ocean (by USD 78,000).

The R2 indicates the model is a good fit, and the p-value is way below a signficance threshold of 0.05, suggesting we can reject the null hypothesis (that these explanatory variables do not predict house value).

This model summary would look good, except that looking at autoplot indicates the linear model is not appropriate here.

To try: standardise or scale the variables somehow? But this won't change the distribution of each (i.e. won't shift the long tail of high house prices or high income...)
